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Due to the time-scale limitations of all-atom simulation of proteins, there has been substantial 
interest in coarse-grained approaches. Some methods, like "Resolution Exchange," [E. Lyman et al, 
Phys. Rev. Lett. 96, 028105 (2006)] can accelerate canonical all-atom sampling, but require properly 
distributed coarse ensembles. We therefore demonstrate that full sampling can indeed be achieved 
in a sufficiently simplified protein model, as verified by a recently developed convergence analysis. 
The model accounts for protein backbone geometry in that rigid peptide planes rotate according to 
atomistically defined dihedral angles, but there are only two degrees of freedom (0 and ip dihedrals) 
per residue. Our convergence analysis indicates that small proteins (up to 89 residues in our tests) 
can be simulated for more than 50 "structural decorrelation times" in less than a week on a single 
processor. We show that the fluctuation behavior is reasonable, as well as discussing applications, 
limitations, and extensions of the model. 



I. INTRODUCTION 

How simplified must a molecular model of a protein 
be for it to allow full canonical sampling? This question 
may be important to the solution of the protein sampling 
problem — the generation of protein structures properly 
distributed according to statistical mechanics — because 
of the well-known inadequacy of all-atom simulations, 
which arc limited to sub-microsecond timescales. Even 
small peptides have proven slow to reach convergence^. 
Sophisticated atomistic methods, moreover, which often 
employ elevated temperatures 2 -'^' 5 -' 6 -, have yet to show 
they can overcome the remaining gap in timescales^ — 
which is generally considered to be several orders of mag- 
nitude. On the other hand, because of the drastically re- 
duced numbers of degrees of freedom and smoother land- 
scapes, coarse-grained models (e.g., Refs. !3l9l iToliTlllT^ , 
[T^li^IlBIiBIi^Iiall^llOIMIIsMlSESSBll^) may have the 
potential to aid the ultimate solution to the sampling 
problem, particularly in light of recently developed al- 
gorithms like "Resolution Exchange" 27 ' 28 and related 
methods 2 ^ -^. 

Although the Resolution Exchange approach, in prin- 
ciple, can produce properly distributed atomistic ensem- 
bles of protein configurations, it requires full sampling at 
the coarse-grained leve l 27 ' 28 . While the potential for such 
full sampling has been suggested by some studies of fold- 
ing and conformational change (e.g., Refs. [26j.;32), con ~ 
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vergence has yet to be carefully quantified in equilibrium 
sampling of folded proteins. How much coarse-graining 
really is necessary? What is the precise computational 
cost of different approaches? This report begins to an- 
swer these questions by studying a united-residue model 
with realistic backbone geometry. 

We will require a quantitative method for assess- 
ing sampling. A number of approaches have been 
suggeste d 1 ' 33 ' 34 ' 35 ' 36 , but we rely on a recently proposed 
statistical approach which directly probes the funda- 
mental configuration-space distributioni 3 - 7 -. The method 
does not require knowledge of important configurational 
states or any parameter fitting. In essence, the ap- 
proach attempts to answer the most fundamental sta- 
tistical question, "What is the minimum time interval 
between snapshots so that a set of structures will be- 
have as if each member were drawn independently from 
the configuration-space distribution exhibited by the full 
trajectory?" This interval is termed the structural decor- 
relation time Tdec, and the goal is to generate simulations 
of length t sim > T dcc . 

In this report, we demonstrate the convergence of the 
equilibrium ensemble for several proteins using a fast, 
united-residue model employing rigid peptide planes. 
The relative motion of the planes is determined by the 
atomistic geometry embodied in the <f> and ip dihedral- 
angle rotations, as explained below. We believe such re- 
alistic backbone geometry will be necessary for success 
in Resolution Exchange studies. The use of geometric 
look-up tables enables the rapid use of only two degrees 
of freedom per residue {<p and tp), and one interaction 
site at the alpha-carbon. The simulations are therefore 
extremely fast. Go interactions stabilize the native state 
while permitting substantial fluctuations in the overall 
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FIG. 1: The rigid peptide plane model used this study. Note 
that, in the coarse-grained simulations, only alpha-carbons 
are represented, and the only degrees of freedom are <j> and 
tp. Other atoms are shown in the figure only to clarify the 
geometry and our assumption of rigid peptide planes. 

backbone. 

After the model and the simulation approach are ex- 
plained, the fluctuations are compared with experimental 
data from X-ray temperature factors and the diversity of 
NMR structure sets. The simulations are then analyzed 
for convergence and timing. 

II. COARSE-GRAINED MODEL 

The coarse-grained model used for this study was cho- 
sen to meet several criteria: (i) the fewest number of 
degrees of freedom per residue; (ii) the ability to utilize 
lookup tables for enhanced simulation speed; (iii) the sta- 
bility of the native state along with the potential for sub- 
stantial non-native fluctuations; and, (iv) the ability to 
allow the addition of chemical detail, as simply as pos- 
sible. Thus, we chose a rigid peptide plane model with 
Go interactions 9 - ' 38 ' 39 and sterics based on alpha-carbon 
interaction sites as shown in Fig. [TJ The use of such a 
simple model, we emphasize, is consistent with our goal 
of understanding both the potential, and the limitations 
of coarse models for statistically valid sampling. Once 
we have understood the costs associated with the present 
model, we can design more realistic models, as discussed 
below. In other words, we made no attempt to design 
the most chemically realistic coarse-grained model, al- 
though we believe the use of atomistic peptide geometry 
is an improvement over a coarse model we considered 
previously 26 . 

The rigid peptide planes allows the use of only two de- 
grees of freedom per residue, arguably the fewest that one 
would consider in such a model. Indeed, this is fewer than 
in a freely rotating chain, although admittedly our model 
requires somewhat more complex simulation moves, de- 
scribed below. 

Go interactions were used because they simultaneously 
stabilize the native state of the protein and also permit 



reasonable equilibrium fluctuations, as was shown in an 
earlier study-2 6 -. Given our interest in native-state fluctu- 
ations and the lack of a universal coarse-grained model 
capable of stabilizing the native state for any protein, 
Go interactions are a natural choice for enforcing sta- 
bility. Further, beyond the reasonable "local" fluctua- 
tions shown below, the model also exhibits partial un- 
folding events which are expected both theoretically and 
experimentall y 40 ' 41 1 42 . 

Because we see the present model as only a first step in 
the development of better models, it is important that it 
easily allows for the addition of chemical detail, such as 
Ramachandran propensities which require only the dihe- 
dral angles we use explicitly^ 3 -. Furthermore, with a rigid 
peptide plane, the locations of all backbone atoms — and 
the beta carbon — are known implicitly. Thus hydrogen- 
bonding and hydrophobic interactions 1 ^ can be included 
in the model with little effort. In other words, the "ex- 
tendibility" of the present simple model was a significant 
factor in its design. 

A. Potential energy of model system 

The total potential used in the model is given by 

U = t/ nat + C/ non , (1) 

where [/ nat is the total energy for native contacts, and 
rjnon j g total energy for non-native contacts. 

For the Go interactions, all residues that are separated 
by a distance less than i? cu t in the experimental structure 
are given native interaction energies defined by a square 
well: 

native 

^nat = J2 Zi nat (r y ), 
{i<0} 

( oo if m < rff(l-S) 
u nat {n 3 ) = I -e if rff(l — S)< n, < rff(l + 6) , 
[ otherwise 

(2) 

where ry is the C a — C a distance between residue i and 
7, r" at is the the distance between the residues in the 
experimental structure, e determines the energy scale of 
the native Go attraction, and 5 is a parameter to choose 
the width of the well. All residues that are separated by 
more than R cu t in the experimental structure are given 
non-native interaction energies defined by 

non— native 
^non = U™ n ( nj ), 

{i,3} 

foe if ry < (pi + pj)(l - S) 
+he if (p i + p j )(l-8)<r ij <R ca t , 
otherwise 

(3) 
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where pi is the hard-core radius of residue i defined as half 
the C a distance to the nearest non-covalently-bondcd 
residue, and h determines the strength of the repulsive 
interaction. 

For this study, parameters were chosen to be similar 
to those in Ref. [2y, i.e., e = 1.0, h = 0.3, S — 0.1, and 
R cut = 8.0 A. 



B. Monte Carlo simulation 

The protein fluctuations were generated using 
Metropolis Monte Carlo^. Trial configurations were gen- 
erated by adding a random Gaussian deviate to the val- 
ues of three sequential pairs of backbone torsions (three 4> 
and three angles). We found that changing six sequen- 
tial backbone torsions maximizes the rate of convergence 
of the equilibrium ensemble (data not shown). The en- 
ergy of the trial configuration was then determined using 
Eq. |T]), and the conformation was accepted with proba- 
bility min(l, e~ Au ^ ksT ), wnere £jj j s th e total change in 
potential energy of the system. The width of the Gaus- 
sian distribution for generating random deviates was cho- 
sen such that the acceptance ratio was about 40% for all 
simulations. The choice of temperature is discussed be- 
low. 



C. Use of lookup tables 

The speed of the coarse-grained simulation was en- 
hanced by using lookup tables to avoid unnecessary com- 
putation. In general, utilizing lookup tables increases 
memory usage while decreasing the number of computa- 
tions. Since memory is inexpensive and can be expanded 
easily, utilizing as much memory as possible can be an ef- 
fective technique for increasing the speed of simulations. 

In our model there are only two degrees of freedom per 
residue (<f>,ip), but C a distances ry must be computed 
to determine native and non-native interaction energies 
given by Eqs. ^ and (J3J). All peptide planes are con- 
sidered to possess ideal, rigid geometry as determined by 
energy minimization of the all-atom OPLS forcefield 45 
using the tinker simulation package^. 

Given a sequence of three residues (alpha carbons), 
we employed a lookup table to provide the Cartesian 
coordinates of the third residue — starting from the N- 
tcrminus — and its normal vector as a function of 4> and 
ijj; see Fig. [T] The table values assume that the first 
residue is at the origin and the second residue is located 
on the z-axis. Once the coordinates for the third residue 
were determined via the lookup table, the fourth residue 
position was determined using the lookup table in con- 
junction with a coordinate rotation and shift. Continu- 
ing in this fashion, coordinates for the entire protein were 
determined. 

The resolution of the lookup table is an important con- 
sideration, i.e., the number of <j), ip values for which Carte- 



sian coordinates are stored. In our simulations, we tried 
resolutions as high as 0.1° and as low as 1.0°, and found 
no difference between the results. Thus, all simulation re- 
sults presented here use lookup tables with a resolution 
of 1.0°. 



D. Initial protein relaxation 

One perhaps unexpected complication of utilizing a 
rigid peptide plane model is that great care must be 
taken to relax the protein before simulations can be per- 
formed. Although initial values of <f>, ip are obtained from 
the X-ray or NMR structure, there are slight deviations 
from planar/ideal geometry in a real protein. These de- 
viations, while small, can accumulate rapidly to become 
very large differences in the Cartesian coordinate posi- 
tions of the residues. Thus, the positions of residues near 
the beginning of the protein will be nearly correct, while 
the residues near the end of the protein will likely have 
large errors — compared to the experimental structure be- 
ing modeled — which can create severe steric clashes or 
even incorrect protein topology. The severity of these 
"errors" necessitates the use of a relaxation procedure 
to generate a suitable starting structure — i.e., a set of 
<f> and ip angles which, with our ideal-geometry peptide 
planes, lead to a topologically reasonable and relatively 
clash-free structure. 

Before we detail our relaxation procedure, we note 
that the need for this additional calculation is an arti- 
fact of the simplicity of our model which can be over- 
come. With the use lookup tables, in fact, it is possible 
to include flexible peptide planes without significantly 
increasing the computational cost of the model. Such 
an approach, which does not require initial relaxation, is 
currently under investigation with promising preliminary 
results (data not shown). 

The relaxation procedure employed in the present 
study first uses the <f>, tp values directly obtained from 
the experimental structure. These dihedrals provide the 
initial (problematic) structure for a coarse-grained sim- 
ulation. Due to the deviations from planarity described 
above, the root means-square deviation (RMSD) between 
the initial structure we create and the experimental struc- 
ture tends to be large (~ 10 A was not uncommon for the 
proteins in this study). To increase the number of native 
contacts and reduce the number of steric clashes, we next 
performed what we term "RMSD Monte Carlo" to relax 
the protein to a low RMSD structure. Trial moves for 
RMSD Monte Carlo were created as described above, but 
accepted with probability min(l, e ~ A( - KmT> ^ kBTRMSI >), 
where fc^TRMSD = 10~ 7 was chosen so that moves to 
a higher RMSD were rare. In other words, the energy 
function itself was not used in this initial phase. 

Since residues near the beginning of the protein have 
less error in the starting structure than residues near the 
end, we used RMSD Monte Carlo in segments. The first 
twenty residues were relaxed until the RMSD was con- 
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stant within a tolerance of 0.0001 A, followed by the first 
forty, then the first sixty and so on until the RMSD of 
the entire protein was relaxed. The RMSD Monte Carlo 
simulation typically brought the RMSD of the simulated 
structure to less than 0.5 A, however, there were gen- 
erally still steric clashes, and some native contacts were 
still not present. 

The final stage of relaxation was to do regular (i.e., 
using energy) Metropolis Monte Carlo simulation, with 
a very low temperature. 

Relaxation was performed until four criteria were met: 
(i) the number of native contacts in the relaxed structure 
was equal to that in the NMR or X-ray structure; (ii) no 
steric clashes were present; (iii) no non-native contacts 
were present, i.e., U non = in Eq. (|3|), and; (iv) the 
RMSD was less than 1.0 A. When these criteria were 
met the structure was saved and used as the starting 
configuration in all future simulations of the protein. 



III. RESULTS AND DISCUSSION 

Using the coarse-grained protein model described 
above, we generated and tested equilibrium ensembles 
for three proteins: barstar (PDB entry 1A19, residues 
1-89), the N-terminal domain of calmodulin (PDB entry 
1CLL, residues 4-75), and the binding domain of protein 
G (PDB entry 1PGB, residues 1-56) 

For each protein, the initial simulation structure was 
generated, followed by RMSD and energy relaxation, as 
described in Sec. Ill Dl Then, production runs of 2 x 10 9 
Monte Carlo moves were performed with snapshots saved 
every 1000 moves, generating an equilibrium ensemble 
with 2 x 10 6 frames. 

In an attempt to obtain consistent results for the three 
proteins, we chose the temperature of the simulation, 
fc^T, to be slightly below the unfolding temperature of 
the protein. The unfolding temperature was determined 
by running simulations over a broad range of tempera- 
tures and studying the RMSD as a function of simulation 
time. The temperatures used in the simulations were 
ksT = 0.6 for barstar, fc^T = 0.4 for calmodulin and 
/cbT = 0.5 for protein G. 



A. Speed of simulations 

Due to the use of lookup tables for coordinate trans- 
formations, the small number of degrees of freedom, and 
utilizing simple square potentials, the equilibrium ensem- 
bles were generated very rapidly. 

Running on one Xeon 2.4 GHz processor, 2 x 10 9 Monte 
Carlo moves with snapshots saved every 1000 steps took 
roughly 6 days for barstar, 4 days for calmodulin, and 3 
days for protein G. Thus, less than a week was required 
to obtain well-converged (see Sec. MI C[) simulations of 
these coarse-grained proteins. 



B. Protein fluctuations 

We first sought to determine whether fluctuations in 
the coarse-grained simulation are reasonable. Figure f2] 
shows the alpha-carbon relative root mean square fluc- 
tuation for three different proteins. The figures show 
that there is reasonable qualitative agreement between 
the NMR, X-ray and simulation data. 

It should be noted that, in fact, none of the three data 
sets in Figs. [5^,, b and c represents the true fluctuations 
in the protein — for different reasons. The X-ray temper- 
ature factor, in addition to thermal fluctuations, includes 
crystal lattice artifacts and other experimental errors^. 
NMR ensembles tend to be biased, perhaps severely, to- 
ward low energy structures, and thus also do not rep- 
resent equilibrium ensembles^ 9 -. Finally, our simulation 
data is not accurate due to the lack of chemical detail in 
the forcefield. 

In spite of the limitations of the analysis, we conclude 
from Fig. [2] that the fluctuations of the coarse-grained 
model are in fact reasonable. 

The bottom panels of Fig. [2] show the whole- molecule 
fluctuations exhibited throughout the trajectories. In 
addition to the ability to sample large conformational 
fluctuations — such as in the case of calmodulin and, to a 
lesser degree, for protein G — the trajectories are visibly 
more converged than is typically observed in atomistic 
simulations, where RMSD values rarely reach a plateau 
value, let alone sampling around that plateau value mul- 
tiple times as would be desirable. 

C. Convergence analysis 

The primary purpose of this report is to demonstrate 
the convergence of the equilibrium ensemble for a coarse- 
grained protein. The details of the convergence analysis 
are described in Ref. H3, so we will only briefly describe 
the method here. 

Previously, Lyman and Zuckerman 1 developed an ap- 
proach which groups sampled conformations into struc- 
tural histogram bins, using the RMSD as a metric. While 
promising, the primary limitation of the method was the 
lack of a quantitative measure of the convergence. 

In the method used here, convergence was analyzed 
by studying the variance of the structural histogram 
bin populations 37 . The new approach allows a rigor- 
ous quantitative estimation of convergence — the struc- 
tural decorrelation time Td C c, given by the time between 
frames required for the variance to reach an analytically 
computable independent-sampling value. Intuitively, and 
mathematically, Td oc is the time interval between snap- 
shots for which they behave as if each frame were drawn 
independently. If simulation times i s im 3> ^dcc are ob- 
tained, the equilibrium ensemble is considered converged. 

Perhaps the most important feature of the convergence 
analysis for our study is that the method does not require 
any prior knowledge of important states. Furthermore, 
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FIG. 2: (Color online) Relative alpha-carbon root mean square fluctuations for three different proteins: (a) barstar, (b) 
calmodulin, and (c) protein G. Each plot shows results for the X-ray structure (dot-dash), the NMR ensemble (dash), and the 
coarse-grained simulation (solid). X-ray results were given by ^3B /8n 2 , where B is the temperature factor given in the PDB 
entry. NMR and simulation data were generated using the g_rmsf program in the GROMACS molecular simulation package-^; 
each ensemble was aligned to the first structure in the corresponding trajectory. For each coarse-grained simulation, 2 x fO 9 
Monte Carlo steps were performed with snapshots saved every 1000 steps, and the potential energy (fT) was set up using the 
X-ray structure. Panels (d) - (f) show the corresponding whole-structure fluctuations as indicated by the RMSD from the 
experimental structures. 



there is no parameter-fitting or subjective analysis of any 
kind. 

Figure [3] shows the convergence properties of the 
coarse-grained simulations using the same trajectories as 
in Fig. [2] The ratio of the observed variance to the ideal 
variance for independent sampling is plotted as a function 
of the time between the configurations used to compute 
the observed variance. When this ratio decreases to one 
the structural decorrelation time Td ec has been reached, 
as shown in the figure. The analysis predicts that each 
simulation is at least 50 times longer than the structural 
decorrelation time. 

Thus we conclude that, in less than a week of single- 
processor time, the equilibrium ensembles for these three 
proteins are well converged. 



IV. CONCLUSIONS 

We have demonstrated the convergence of the equilib- 
rium ensemble for a simple united-residue protein model. 
The model assumes rigid peptide planes, with atomisti- 
cally correct geometry, and exhibits reasonable residue- 
level fluctuations based the planes' geometry, Go inter- 
actions, and sterics. 



Most importantly, the results indicate quantitatively 
that carefully designed united-residue models have the 
potential to fully sample protein fluctuations. By us- 
ing only two degrees of freedom per residue, look up ta- 
bles for coordinate transforms, and simple square well 
potentials, we were able to demonstrate that converged 
equilibrium ensembles can be obtained in less than a 
week of single processor time. The quantitative conver- 
gence analysis indicates that more than 50 "decorrela- 
tion times" were simulated in each case, indicating high- 
precision ensembles. In addition to application in Reso- 
lution Exchange sampling of all-atom model a 27 i 28 , such 
speed opens up the long-term possibility of large-scale 
simulation of many proteins. 

One important practical limitation of the ideal- 
peptide-plane geometry in the present model is the need 
to relax the the initial structure. Proteins larger than 100 
residues are difficult to relax. However, we have already 
begun investigating a flexible-plane model incorporating 
lookup tables which exhibits no such limitation and re- 
mains computationally affordable. We will report on the 
flexible model in the future. 

Although the intrinsic atomistic geometry of the pep- 
tide plane was included in our model, it lacks chemical 
interactions. Yet because we obtained converged ensem- 
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FIG. 3: Convergence analysis for coarse-grained simulations of three different proteins: (a) barstar, (b) calmodulin, and (c) 
protein G. Each plot shows the convergence properties for the same trajectories as used for Fig. [2] analyzed using the procedure 
in Ref. The number of frames required to reach the value of one (the solid horizontal line) is an approximation of the 
structural decorrelation time Td cc and is shown on each plot. The three curves on each plot are results for different histogram 
sub-sample sizes^ 7 - and demonstrates the robustness of the value of race- The plots predict that the decorrelation times are 
roughly 40 000 frames for barstar, 20 000 frames for calmodulin and 30 000 frames for protein G. Note that the total number 
of frames generated for each protein during the simulation was 2 x 10 6 . Thus, since each simulation was more than 50rd C c in 
length, we conclude that the equilibrium ensembles are well-converged. Error bars represent 80% confidence intervals in the 
expected fluctuations around the ideal value of one, based on the given trajectory length and the numerical procedure used to 
generate the solid curve. 



bles in such a short time, it is clear we can "afford" exten- 
sions to the model which include realistic chemistry. For 
instance, additional potential energy terms such as Ra- 
machandran propensities^ 3 -, hydrophobic interactions* 5 - 
and hydrogen-bonding can be included at small cost. 

Aside from the potential for rigorous atomistic 
samplin g 27 ' 28 ' 50 , it is important to note the general use- 
fulness of coarse-grained models for generating ad hoc 
atomistic ensembles. Specifically, upon generating a well- 
sampled ensemble of coarse-grained structures, atomic 
detail can be added using existing software such as those 
in Refs. I5lll52l . Once minimized and relaxed, these (now) 
atomically detailed structures form an ad hoc ensem- 
ble which may be of immediate use in dockin g 53 ' 54 and 
homology modeling applications. Further, in principle, 
such structures can be re-weighted into the Boltzmann 
distribution^. 



In the long term, one can imagine a day when struc- 
tural databases will be based not on single (static) 
structures but rather will collect ensembles — as envi- 
sioned in the authors' scheme for an "Ensemble Protein 



Database" (http: //www. epdb.pitt.edu/ ) 
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